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ABSTRACT. Feature selection with specific multivariate performance measures is the key 
to the success of many applications such as information retrieval. In this paper, we pro- 
pose a feature selection method for multivariate performance measures. The proposed 
method forms an optimization problem with exponential size of both feature groups and 
label configurations for a given dataset. To address this problem, a two-layer cutting plane 
algorithm is proposed. The outer layer performs group feature generation; while the inner 
layer learns the label configuration for multivariate performance measures. Comprehen- 
sive experiments on large-scale and high-dimensional real world datasets show that the 
proposed method can significantly outperform ii-SVM and SVM-RFE when choosing a 
small subset of features, and achieve significantly improved performances over SYM per ' 
in terms of F\ -score. It also learns a sparse yet effective decision rale for multivariate 
performance measures. 

1. Introduction 

Feature selection is crucial to many applications such as text mining, image retrieval 
and bioinformatics. These applications usually contain a huge amount of features which 
incur very high computational costs for analysis. Pruning non-informative or noisy features 
can usually improve the generalization performance. Moreover, a small set of feature is 
beneficial to visualize or interpret the results. One of the most widely used criteria for 
feature selection is the maximum margin criterion, particularly on Support Vector Machine 
(S VM). The weights of the SVM model can be used for feature selection in two directions. 
One way is to consider the sparsity of weights by replacing ^ 2 - norm regularization with 
^-norm ||28l 151 IT31 . Recently, Yuan et al. l24l conducted a thorough study to compare 
several recently developed l\ -regularized algorithms. From their study, coordinate descent 
method using one-dimensional Newton direction (CDN) can achieve the state-of-the-art 
performance on solving li -regularized problems. 

To achieve a sparser solution, the Approximation of the zeRO norm Minimization 
(AROM) was proposed I2TI . Its resultant problem is non-convex, so it suffers from local 
optima. Despite this, the recent results [12] and theoretical studies ll26l also showed that 
l p models (p < 1) even with a local optimal solution achieves better parameter estimation 
performances than convex li models, which are asymptotically biased [ .12]. Chan et al. 
also proposed two convex relaxations to Iq-S VM, but they are computationally expensive, 
especially for high dimensional datasets. Another way is to sort the weights and remove 
the smallest weights iteratively in SVM-Recursive Feature Elimination (SVM-RFE)l6|. 
However, as discussed in 11231 . such nested "monotonic" feature selection scheme leads to 
the suboptimal performance. Non-monotonic feature selection (NMMKL) (23l was pro- 
posed to solve this problem, but each feature corresponding to one kernel makes NMMKL 
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infeasible for high dimensional problems. Recently, Tan et al. |[T8l proposed Feature Gen- 
erating Machine(FGM), which shows great scalability to non-monotonic feature selection 
on large-scale and very high-dimensional datasets. However, since FGM is formulated for 
the 0-1 loss function, it is not appropriate for other specific applications. 

Depending on applications, specific performance measures are usually required to eval- 
uate the success of a learning algorithm. In text classification, for example, F\ -score 
and Precision/Recall Breakeven Point(PRBEP) are used to evaluate classifier performance; 
while error rate is not suitable due to a large imbalance between positive and negative ex- 
amples [:8 ;|. Thereafter, SVM per * (8) was proposed for multivariate performance measures. 
As shown in [8 ||, optimizing the learning model subject to the specific multivariate perfor- 
mance measures can boost the corresponding performance. However, for high dimension 
data, such as image and document retrieval, it is urgent to perform feature selection since 
the noisy or non-informative features may degrade performance measures. Moreover, fea- 
ture selection helps significantly speed up the prediction on high dimensional data in real 
world retrieval applications. 

In this paper, we propose to make FGM suitable for multivariate performance mea- 
sures. By transplanting a 0-1 control variable associated with each feature into multivariate 
prediction framework [8], we can derive the modified FGM for multivariate performance 
measures, namely FGM per ^. Note the resultant optimization problem is more complicated 
than that of FGM due to the exponential size of both the subset of features and label con- 
figuration for all examples. Under this situation, existing Multiple Kernel Learning (MKL) 
algorithms are infeasible to be utilized for solving the MKL problem inside FGM because 
of the exponential number of constraints in the primal form or optimization variables in 
the dual form. 

To this end, we propose a two-layer cutting plane algorithm: the outer layer performs 
group feature generation; while the inner one selects label configurations for multivari- 
ate performance measures. Comprehensive experiments on several large-scale and very 
high-dimensional real world datasets show that the proposed method yields comparable 
performance with state-of-the-art feature selection methods on the 0-1 loss, and outper- 
forms SVM per f using all features in term of multivariate performance measures. 

In the rest of this paper, we denote the transpose of a vector/matrix by the superscript T 
and l p norm of a vector v by | |v| \ p . Binary operator represents the elementwise product 
between two vectors/matrices. 



2. Multivariate Performance Measures 

A large class of multivariate performance measures, such as i^i-score, Recall @fc and 
Precision/Recall Breakeven Point(PRBEP) are non-linear and multivariate. Their decision 
theoretic risks cannot be decomposed into expectations over individual examples, and are 
difficult to be optimized directly. (8) proposed to formulate the learning problem as a 
multivariate prediction of all examples in the dataset in order to accommodate this prob- 
lem based on sparse approximation algorithm for structural SVMs f20l . Given a training 
sample of input-output pairs (xi, y\), . . . , (x n , y n ) 6 X x y drawn from some fixed but 
unknown probability distribution with X 6 R m and y 6 {— 1,+1}. The learning problem 
is treated as a multivariate prediction problem by defining the hypotheses h that map a tuple 
x € X of n feature vectors x = (xi , . . . , x„) to a tuple y G y of n labels y = (yi, . . . ,y n ), 
h:X — > y where X = X X . . . , X and y C {-1, +1}" is the set of all admissible label 
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vectors. The linear discriminant function are then defined as follows, 

n 

(1) h w (x) = arg max { V y -w T x, } , 

V'ey l ^( J 

where w is the weight vector. The multivariate loss functions can be easily incorporated 
into structural SVM in one slack variable formula as follows, 

(2) min -\\w\\ 2 + C£, 

w,£>0 2 

s.t. Vy' G y\y : w T ^(y, - yfa > A(y,y') - £, 

i=l 

where A(y,y') is some type of multivariate loss functions. This optimization problem 
is a convex optimization problem, but there is the exponential size of y. However, this 
problem can be solved in polynomial time by adopting the sparse approximation algorithm 
of structural SVM. 



3. Feature Generating Machine (FGM) 

Feature Generating Machine(FGM) was proposed by [fT~8l to learn a sparse solution to 
SVM. The discriminant function of this sparse SVM model is represented as follows, 

(3) /i w (x)=%n((w0d) r x), 

where h w : X — » y, and d = [d%, . . . , d m ] T is a vector of 0-1 control variables in the 
domain of V = {d X^li 4/ — B,dj 6 {0, 1}, Vj = 1, . . . , m). The parameter B is a 
budget to control the sparsity of d. For clarity, we call a feature configuration d* € V as 
a group. Namely, if the ith feature is selected into the group d*, then d\ = 1, otherwise 0. 
The sparse representation of a group is a set of indices with d\ = 1. Then, FGM attempts to 
learn this discriminant function and performs feature selection simultaneously by solving 
the following optimization problem, 

i=l 

yiW T (xiQd) > p-£i,Vi = l,...,n, 

where p/||w|| is the margin separation. This problem is in form of a mixed integer pro- 
gramming problem, which is computationally expensive due to the exponential size of V. 
Ifl8ll proposed to solve this problem by a cutting plane algorithm which generates a pool of 
the most violated feature subsets and then combines them via MKL algorithm iteratively. 



(4) min min 

dex> w,£,p 

s.t. 



4. Feature Selection for Multivariate Performance Measures 

In this Section, we illustrate the proposed method in details. By a simple combination 
of the discriminant functions in ([TJ and (0, we can obtain a new discriminant functions 

/i w (x) as 

n 

(5) h w (x) = argmax] y^j/i(w0d) T Xi \. 

ii'CV I — ' > 
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To optimize the multivariate loss functions and learn a sparse feature representation, we 
propose to solve the following problem, 

1 2 

(6) minde-Dmin -||w||~ + C£ 

w,£>0 z 

1 " 

s.t. \V G y\y : w T - J2( yi - yl)fa d) > A{y,lf) - £, 

i=l 

where A = —A(y, y') is the average of multivariate loss. This problem turns out to be 
even more complicated to be solved due to the exponential size of both T> and y. To this 
end, we propose a two-layer cutting plane algorithm to solve it efficiently and effectively. 
The two layers, group feature generation and group feature selection, will be described in 
Section l^TTl and POl respectively. The two-layer cutting plane algorithm will be presented 
in Sectionl4~3landl4!4l 

4.1. Group Feature Generation. This layer is similar to FGM [18] to generate a pool of 
the most violated feature subsets, but the dual form of Problem © has exponential number 
of dual variables. The partial dual with respect to w, £ can be obtained as follows, 

™ e a l ~ 2 5^ zC a v ' a v" 0? ,y" + a v' V > 

V V" y' 

where a is the dual variables, Q^ v „ = (a ¥ ',ap»), a v > = ^Y%=i(Vi ~ Vi)( x i 
d), by, = ±A(y,yO> and A = {E r a v> < C,a > 0}. By denoting S(a,d) = 
~ I J2y' J2y" a v' a v" Qjj' ,y" +Yjy< %' V - Problem © turns out to be min deC max aS ^ S (a 

Following lfTTl[T8l . we introduce a mild convex relaxation for above problem. Accord- 
ing to the minimax inequality iflOl . we can interchange the min and max to obtain a lower- 
bounded problem, max Qe ^ mindgx) S(a, d). We further denote J-a(a) = —S(a, d), then 
we have 

(7) min max FaUx) or min 7 : 7 > J-"&(a), Vd 6 T>. 

Though there are exponential number of d's in T>, fortunately, only a few constraints in (|7} 
are active at the optimality, and including only a subset of these constraints can usually lead 
to a very tight approximation of the original optimization problem. Cutting plane algorithm 
l9l could be used here to solve this problem. Since maxdgu J-d(a) > J-j* (a), Vd* 6 V, 
the lower bound approximation of ((7) can be obtainedby maxdg-D F&{pi) > max t= i . ^ -^d* 
Then we can minimize the lower bound of (0 by, 

(8) min max J-d*(a) or min 7 : 7>7 r dt(a),Vi=l J . . .,T. 

a£At=l,...,T aeA,f 

As from JT4), such cutting plane algorithm can converge to a robust optimal solution within 
tens of iterations with the exact worst-case analysis. Specifically, for a fixed a*, the worst- 
case analysis can be done by solving, 

(9) d* = argmax Jd(a'), 

which is refen - ed to as the group generation procedure. However, Problem © and (0 can- 
not be solved directly due to the exponential size of a where each entry of a corresponds 
to one configuration in y. 
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4.2. Group Feature Selection. By introducing dual variables fx = [^1,^2, • • , ht] t > 
0, we can transform (|8} to a MKL problem as follows, 

(10) Sm^It ~\ ^ 5 av ' av " fe M * g ^'^" ) + ^ QlJ ' 617 '' 

y' y" \t=l / y' 

where X T = {X^Li Mt = 1- M > 0}. 

However, due to the exponential size of a, the complexity of Problem ( TTOb remains. In 
this case, state-of-the-art multiple kernel learning algorithms ifTTl [T6l l22l do not work any 
more. The following proposition shows that we can indirectly solve Problem ( TTOb in the 
primal form. 

Proposition 1. The primal form of Problem ( 1701 ) is 

r 

wi,...,wt,£>0 2 



t=i 

According to KKT conditions, the solution of is 
(12) w t = Ht^ciy-at, 



y 



where /i t is a dual value of the t th constraint of©. 

Here, we define the regularization term as Sl(w) = \ fEt=i ll w t||2^ with w 
[wi , . . . , Wr] T and the empirical risk function as 

1 T 
(13) R emp (w) = - max ( 0, _max_ by - y] (w t , a~/ ' 

n v y'ey\y t=1 

which is a convex but non-smooth function w.r.t w. Then we can apply the bundle method 
|fl9l to solve this primal problem. Problem (Q~T} is transformed as 

min J(W) = fi(W) + CR emp (w). 

w 

Since i? emp (w) is a convex function, its subgradient exists everywhere in its domain 0. 
Suppose w fc is a point in where i? e mp(w) is finite, we can formulate the lower bound 
according to the definition of subgradient, 

i?emp(W) > i? emp (W fc ) + (W - \V k , p k ), 

where subgradient p k s d^R emp (yv )isatw . Given subgradient sequence p^p 2 , ... ,p K , 
the tighter lower bound for R em p{w) can be stated as follows, 

-Remp(w) > i?^ np (w) = max (0, i max^(w,p A; } + q k 



(1 
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where q k = i? emp (w fc ) — (w fc , p fc ). Following the bundle method lfl9l , the criterion for 
selecting the next point w A+1 is to solve the following optimization problem, 

(14) min 1 (V||w t || 2 J +C£ 

s.t. £ > (w,p fc ) +q k ,Vk = l,...,K. 

The following Corollary shows that Problem (Tl4t can be easily solved by QCQP solvers, 
and the number of variables is independent of the number of examples. 

Corollary 1. In terms of Proposition^ the dual form of Problem ( 1741 ) is 

K 

(15) max max —6 + N otky 

a£A K 6 



k=l 



K 2 

SJ - nW^TakPt 



1 



2 ^ 

k=l 



< o,vt = 1, 

2 



where Ak = {Sfc=i a fc — a — 0) Vfc = 1, . . . , -K^}, one/ which is a QCQP problem 
with T + 1 constraints and K + 1 variables. 

Remark that Problem (TBI is similar to the Support Kernel Machine(SKM) (2[ m which 
the multiple Gaussian kernels are built on random subsets of features, with varying widths. 
However, our method can automatically choose the most violated subset of features as a 
group instead of a subset of random features. Such random features lead to a local opti- 
mum; while our method could guarantee the e-optimality stated in Theorem Q] However, 
due to the combinational structure of T> in Problem (O, the current model can only work for 
linear kernel with different subsets of features. By setting Ja-(w) = f2(w) + CR^ mp (\f), 
the e-optimal condition in Algorithm[T]is mhio<k<K J(w A ) — Jk^ K ) < £■ 

Algorithm 1 group_feature_selection 



Input: x = (xi, . . . ,x„), y = (yi, . . . , y n ), an initial group set W 

y = %,k = Q 

repeat 

k = k + 1 

Finding the most violated y' 
Compute p fe and q k 

y = yu{y'} 

Solving Problem (T5[ over W and y 
until e-optimal 



4.3. Two-Layer Cutting Plane Algorithm. Algorithm Q] can obtain the e-optimal so- 
lution for the original dual problem (|8)- By denoting Qd(ot) = ^|| Ylk=i ^Ptlli — 
Tl^=i a kq k , the group feature generation layer can directly use the e-optimal solution 
of the objective Qd(a) to approximate the original objective J-&(a). The two-layer cutting 
plane algorithm is presented in Algorithm |2] From the description of Algorithm |2] it is 
clear to see that groups are dynamically generated and augmented into active set W for 
group selection. 
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Algorithm 2 The Two-Layer Method 

l: Input: x= (xi, . . . ,x n ),y = (yi,...,y n ) 

2: W = 0, t = 

3: repeat 

4: i = t + 1 

5: Finding the most violated d* 

6: W = WU {d*} 

7: group_feature_selection(x, y, W) 

8: until Convergence 



In terms of the convergence proof of FGM in JT8), we can obtain the following theorem 
to illustrate the approximation with an e-optimal solution to the original problem. 

Theorem 1. After Algorithm\2\stops in a finite number of steps, the difference between opti- 
mal solution (d* , a*) of Problem ( 17 OP and the solution (d, a) ofAlgorithm\2\is S(a* ,d*) — 
O d (a) < e. 

4.4. Finding the Most Violated y' and d. Algorithm Q] and Algorithm [2] need to find the 
most violated y' and d, respectively. In this subsection, we discuss how to obtain these 
quantities efficiently. 

Algorithm[T]needs to calculate the subgradient of the empirical risk function R^f mp (\v). 
Since R^ mp (w) is a pointwise supremum function, the subgradient should be in the convex 
hull of the gradient of the decomposed functions with the largest objective. Here, we just 
take one of this subgradient by solving 

n 

(16) y k =&rg max A{y',y) - V(y.; - y[)vi, 

y'ey\y ^{ 

whereto = Y^t=i (xiOd 1 ). After obtaining y' , it is easy to compute p^ = — —^2™ = i(yi — 

For finding the most violated y 1 , it depends on how to define the loss A(y, y') in Prob- 
lem ( fToT ). One of the instances is the hamming loss which can be decomposed and com- 
puted independently, A(y,y') = ^™ =1 5(yi,y' i ), where 5 is an indicator function with 
S(yi, yl) = if yi = y\, otherwise 1. However, there are some multivariate performance 
measures which could not be solved independently. Fortunately, there are a series of struc- 
tured loss functions, such as Area Under ROC (AUC), Average Precision (AP), ranking 
and contingency table scores and other measures listed in JU |25] [19] , which can be im- 
plemented efficiently in our algorithms. In this paper, we only use several multivariate 
performance measures based on contingency table as the showcases and their finding y k 
could be solved in time complexity 0(n 2 ) |8). 

Given the true labels y and predicted labels y', the contingency tables is defined as 
follows 





y=l 


y=-l 


y'=l 


a 


b 


y'=-l 


c 


d 



Fx -score: The Fg-score is a weighted harmonic average of Precision and Recall. Ac- 
cording to the contingency table, we can obtain 

(i + /3 2 )q 
fj (l+/3 2 )a + & + /3V 
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Table 1 . Datasets used in our experiments 



Dataset 


#classes 


#features 


#train. 


#test 








points 


points 


News20.binary 


2 


1,355,191 


11,997 


7,999 


URL1 


2 


3,231,961 


20,000 


20,000 


Image 


5 


10,800 


1,200 


800 


Rcvl 


53 


47,236 


15,564 


518,571 


Sector 


105 


55,197 


6,412 


3,207 


News20 


20 


62,061 


15,935 


3,993 



The most common choice is j3 = 1. The corresponding balanced F± measure loss can be 
written as A(a, b, c, d) = 100(1 — Fp). Then, Algorithm 2 in (8) can be directly applied. 

Precision/Recall® k In search engine systems, most users scan only the first few links 
that are presented. In this situation, Prec@k and Rec@k measure the precision and recall 
of a classifier that predicts exactly k documents, i.e., 

Prec@k = a , Rec@k = - 

a + b a + c 

subject to a + b = k. The corresponding loss could be defined as Ap rec @t = 100(1 — 
Prec@k), An ec @k = 100(1 — RecMk). And the procedure of finding most violated y is 
similar to F-score, while the only difference is keeping constraint a + b = k and removing 
a + b k. In the evaluation part, we label all the k largest decision value as 1, and then 
calculate the values of Prec@k and Rec@k. 

Precision/Recall Break-Even Point The Precision/Recall Break-Even Point makes a 
precision and recall are equal. According to above definition, we can see PRBEP only 
adds a constraint a + b = a + c, or b = c. The corresponding loss could be defined as 
A-prbep = 100(1 — PRBEP). Finding the most violated y should enforce the constraint 
b = c. 

Now, we can simplify a in Problem (O from the exponential size to T. Then finding 
the most violated d in Algorithm|2]becomes 

(17) d 4 = arg max £?«,(«') 

1 II 1 K " 2 

= argmax- - V a* fe V(y, - yf)( Xl d) 
dev 2 II n * — ' * — ' 

fe=i t=i 

1 D 

= arg max — — V* c 2 dj 
dev 2n 2 ^ 3 

3=1 

where Cj = J2k=i a l Y17=i(v* ~ Vi) x hj' With the budget constraint YliLi di < B in 
T>, ( fTTI i can be solved by first sorting c 2 's in the descent order and then setting the first B 
numbers corresponding to dj to 1 and the rest to 0. This takes only 0(m log m) operations. 

5. Experiments 

In this Section, we conduct extensive experiments to evaluate the performance of our 
proposed method and state-of-the-art feature selection methods: 1) SVM-RFE (6); 2) l\- 
SVM; 3) FGM Q]Q. SVM-RFE and FGM use Liblinear software Q as the QP solver for 
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their SVM subproblems. For Zi-SVM, we also use Liblinear software, which implements 
the state-of-the-art Zi-SVM algorithm ll24l . 

For convenience, we name our proposed two-layer cutting plane algorithm FGM^ u ; ti , 
where A represents different type of multivariate performance measures. We implemented 
Algorithm|2]in MATLAB for all the multivariate performance measures listed above, using 
Mosek 1 1 1 as the QCQP solver for Problem (fTSt which yields a worse-case complexity of 
0(KT 2 ). Since the values of both K and T are much smaller than the number of exam- 
ples n and its dimensionality m, the QCQP is very efficient as well as more accurate for 
large-scale and high-dimensional datasets. Furthermore, the codes simultaneously solve 
the primal and its dual form. So the optimal \x and a can be obtained after solving Problem 

For a test pattern x, the discriminant function can be obtained by /(x) = (w © d, x) 

where w = Yh=i fo x i> Pi = h J2k=i a k(Vr ~ Vi)> and d = 2~2t=i Vt^- This leads to the 
faster prediction since only a few of the selected features are involved. After computing 
p fe , the matrices of Problem (fT5t can be incrementally updated, so it can be done totally in 

0{TK 2 ). 

5.1. Feature Selection for Accuracy. Since JS] has proven that SVM^ uHi with ham- 
ming loss, namely A^ rr (y, y') = 2(6 + c), is the same as SVM. In this subsection, we 
first evaluate the accuracy performances of FGM^ uZti for hamming loss function, namely 
FGM''^ 1 ™" 19 as well as other state-of-the-art feature selection methods. We compare 
these methods on two binary datasets, News20. binary^ and URL1 in TableQ] These two 
datasets are used in Ell. We use the standard training and test sets to evaluate the predic- 
tion performance of feature selection methods in this experiment. 

We test FGM and SVM-RFE in the grid C FGM = [0.001, 0.01, 0.1,1,5, 10] and choose 
Cfgm = 5 which gives good performance for both FGM and SVM-RFE. This is the 
same as |[T8l . For FGM^™?™" 3 , we do the experiments by fixing CFGM„,„ Ki as 0.1 x 
n for URL1 and 1.0 x n for New20. binary. The setting for budget parameter B = 
[2, 5, 10, 50, 100, 150, 200, 250] for News20.binary, and B = [2, 5, 10, 20, 30, 40, 50, 60] 
for URL1 . The elimination scheme of features for SVM-RFE method can be referred to 
lfl8ll . For Zi-SVM, we report the results of different C values so as to obtain different 
number of selected features. 

Figure [TJreports the testing accuracy on different datasets. The testing accuracy is com- 
parable among different methods, but both FGM^"™" 9 and FGM can obtain better pre- 
diction performances than SVM-RFE in a small number (less than 20) of selected features 
on both News20. binary and URL1 . These results show that the proposed method with 
hamming loss can work well on feature selection tasks especially when choosing only a 
few features. FGM^ a J^ ms also performs better than Zi-SVM on News20. binary in most 
range of selected features. This is possibly because li models are more sensitive to noisy 
or redundant features on News20. binary dataset. 

Figure|2]shows that our method with the small B will select smaller number of features 
than the large B. We also observed that most of features selected by the small B also 
appeared in the subset of features using the large B. This phenomenon can be obviously 
observed on News20. binary. This leads to the conclusion that FGM^™ 1 ™ 1 ™ 5 can select the 
important features in the given datasets due to the insensitivity of parameter B. However, 
we notice that not all the features in the selected subset of features with smaller B fall 
into that of subset of features with the large B, so our method is non-monotonic feature 
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selection. This argument is consistent with the test accuracy in Figure [T] News20,bianry 
seems to be monotonic datasets from Figure [2] since FGM3™, FGM and SVM-RFE 
demonstrate similar performance. However, URL1 is more likely to be non-monotonic, as 
our method and FGM can do better than SVM-RFE. All the facts imply that the proposed 
method is comparable with FGM and SVM-RFE, but our method also demonstrates the 
non-monotonic property for feature selection. 




Figure 1 . Testing accuracy on different datasets 




100 200 300 400 500 600 700 100 200 300 400 500 

the sorted index of selected features the sorted index of selected features 



(a) News20.binary (b) URL1 



Figure 2. The sparsity of features of FGM^™ 1 ^"™ 9 with varying B 
on different datasets. Each row bar with different color represents the 
different subset of features selected under current B, where the white 
region means the features are not selected. 



5.2. Feature Selection for Image Retrieval. In this subsection, we demonstrate the spe- 
cific multivariate performance measures are important to select features for real applica- 
tions. In particular, we evaluate F\ measure (commonly used performance measure) for 
the task of image retrieval. Due to the success of transforming multiple instance learning 
into a feature selection problem by embedded instance selection, we use the same strategy 
in Algorithm 4.1 of [0] to construct a dense and high dimensional dataset on a prepro- 
cessed image data[| This dataset is used in 11271 for multi-instance learning. It contains 
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Figure 3 . Testing Fi scores on Image dataset. 



five categories and 2,000 images, each image is represented as a bag of nine instances 
generated by the SBN method 0131 . Each image bag is represented by a collection of nine 
15 -dimensional feature vectors. After that, following 2], the natural scene image retrieval 
problem turns out to be a feature selection task to select relevant embedded instances for 
prediction. The Image dataset are split randomly with the proportion of 60% for training 
and 40% for testing (Table[T). Since Fi-score is used for performance metric, we perform 
FGM multi for Fi -score, namely FGM 1 , . . as well as other state-of-the-art feature selec- 
tion methods. As mentioned above, FGM and FGM'^J^"" 9 have similar performances, 
we will not report the results of FGM here. FGM^ 1 ^ 1 ™ 1 " 9 and FGM^ u ; ti use the fixed 
C = 10 x n. For other methods, we use the previous settings. The testing F± values of all 
methods on each category are reported in Figure |3] 



12 



Q. MAO AND IVOR W. TSANG 



Table 2. The macro-average testing performance comparisons among 
different methods. The quantities in the parentheses represent won/lost 
of the current method comparing with FGM^ ulti . The last column in- 
dicates the average number of features is actually used in the current 
method for a specific measure. 



Dataset 


method 


Fx Rec@2p PRBEP 


#selected features 


Rcvl 


tUM multi 

svm£„. m 


42.68 67.81 58.01 
26.81 (0/51) 36.19 (0/50 ) 31.30(0/49) 
26.55 (5/46) 63.05 (24/26) 55.48 (15/35) 


690.4/673.7/547.8 
451.6 
47,236 


Sector 


r-f \ rhamming 
tUM multi 

svm£„. m 


92.07 95.77 93.25 
84.99 (12/91) 90.01 (0/71) 85.54(0/86) 
33.35 (1/104) 95.52 (11/19) 91.24(11/47) 


787.6/658.9/508.3 
689.2 
55,197 


News20 


rhamming 


77.56 91.21 81.46 
49.61 (0/20) 66.32 (0/20) 52.14(0/20) 
55.53 (0/20) 93.08 (16/2) 80.83 (6/11) 


1,301 / 1,186/931 
485.1 
62,061 



From Figure |3] we observe that FGM^ ulti and FGM^™™" 9 achieve significantly im- 
proved performance over l\ methods in term of Fi-score especially when choosing less 
than 100 features. Moreover, SVM-RFE also outperforms l\ methods on three categories 
out of five. This verifies that l\ penalty does not perform as well as Iq methods like 
FGM^- iti and FGM^™^™™ 9 on dense and high dimensional datasets. It is possibly be- 
cause ^i-norm penalty is very sensitive to dense and noisy features. We also observe that 
FGM vLm performs better than FGM'^™""" 9 and SVM-RFE on four over five categories. 
All these facts imply that directly optimizing F\ measures is useful to boost Fx perfor- 
mance measure. 

5.3. Multivariate Performance Measures for Document Retrieval. In this subsection, 
we focus on feature selection for different multivariate performance measures on imbal- 
anced text data shown in Table [1] For multiclass classification problems, one vs. rest 
strategy is used. The comparing model is SVM per/ Q Following JS), we use the same 
notation SVM^ mM for different multivariate performance measures. The command used 
for training SVM per ^ can work for different measures by -I option 0. In our experiments, 
we search the C per f in the same range [2 -6 , . . . , 2 6 ] as in (8). We choose the one which 
demonstrates the best performance of SVM^ lulti to each multivariate performance mea- 
sure for comparison. FGM^ u ; ti and FGM^J^'™"' 5 fix CpGM mMi = 0.1 X n for Rcvl and 
News20 except 1.0 x n for Sector. For Rec@k, we use k as twice the number of positive 
examples, namely Rec@2p. The evaluation for this measure uses the same strategy to label 
twice the number of positive examples as positive in the test datasets, and then calculate 
Rec@2p. 

Table [2] shows the macro-average of the performance over all classes in a collection in 
which both FGM^ u ; ti and FGM J^u™ 9 w i m onr y B — 250 are listed. The improvement 
of FGM^ uZtJ over FGM^"" 1 " 9 and SVM^ uZti with respect to different B values are 
reported in Figure|4] From Table [2] FGM^ u j M is consistently better than FGM^™ t ™ m9 on 
all multivariate performance measures and three multiclass datasets. Similar results can be 



'www.cs.comell.edu/People/tj/svm_light/svm_perf.html 
1 svm_perf_learn -c C per f -w 3 -b train.file trainjnodel 
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FIGURE 4. The average performance improvement of FGM^ lulti with 
varying B on different datasets. 



obtained comparing with SVM multi , while the only exception is the measure Rec@2p on 
News20 where SVM^ u ; ti is a little better than FGM^ uiti . The largest gains are observed 
for Fi score on all three text classification tasks. This implies that a small number of 
features selected by ¥GM^ lulti is enough to obtain comparable or even better performances 
for different measures with SVM„ utti using all features. 

From Figure H] FGM^ u;ti consistently performs better than FGM^™?™" 9 for all of 
the multivariate performance measures from the figures in the left-hand side. Moreover, 
the figures in the right-hand side show that the small number of features are good for F\ 
measures, but poor for other measures. As the number of features increases, Rec@2p and 
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PRBEP can approach to the results of SVM„ u ; ti and all curves become flat. The perfor- 
mance of PRBEP and Rec@2p is relatively stable when sufficient features are selected, 
but our method can choose very few features for fast prediction. For F\ measure, our 
method is consistently better than SVM^ uiti , and the results show significant improve- 
ment over all range of B. This improvement may be due to the reduction of noisy or 
non-informative features. Furthermore, FGM^ uHi can achieve better performance mea- 
sures than FGM muit . ■ 

6. Conclusion 

Learning algorithms need application specific performance measures to evaluate its suc- 
cess. Due to the high dimensionality of the data in many applications, S VM for multivari- 
ate performance measures on full set of features may degrade. In this paper, we propose a 
learning framework to train the S VM model for multivariate performance measures and do 
feature selection at the same time. To solve this optimization problem, a two-layer cutting 
plane algorithm was proposed. Experimental results showed that the proposed method is 
comparable with FGM and S VM-RFE and better than li models on feature selection task, 
and outperforms SVM for multivariate performance measures on full set of features. 
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